Efficient Approximate Dictionary Matching
نویسندگان
چکیده
Named entity recognition (NER) systems are important for extracting useful information from unstructured data sources. It is known that large domain dictionaries help in improving extraction performance of NER. Unstructured text usually contains entity mentions that are different from their standard dictionary form. Approximate matching is important to identify the correct dictionary entity for such variants. This is a challenging problem, as every entity in the dictionary is a candidate match for the variant. In this paper, we propose a novel approach for efficient approximate dictionary matching. The key idea is to compare a given query only against a set of most likely candidate matches from the dictionary so as to achieve substantial reduction in the number of matching operations. In order to enable this, the proposed approach first performs clustering of similar entities and then represents each cluster with a profile matrix, which stores the probability of an occurrence of a particular character at a specific location in the entity string. Thus, the dictionary is represented with a set of profile matrices, which are much smaller than the actual number of entities. A given query entity is first matched against the profiles and the clusters corresponding to top-K best scoring profiles are selected to obtain a list of most likely matching candidates. The query is then compared with each candidate match entity and the approximate match is declared if both the query and the candidate entity are within acceptable edit distance threshold. We have performed rigorous evaluation of our approach on several publicly available datasets. The proposed algorithm outperforms alternative approaches in detecting approximately matching entities for a given query using far lesser number of comparison operations. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The 18th International Conference on Management of Data (COMAD), 14th-16th Dec, 2012 at Pune, India. Copyright c ©2012 Computer Society of India (CSI).
منابع مشابه
Simple and Efficient Algorithm for Approximate Dictionary Matching
This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ overlap join of inverted lists. First we show that this task is solvable exactly by a τ -overlap join. Given inverted lists retrieved for a query, the algorithm coll...
متن کاملApproximate Matching of Hand-Drawn Pictograms
We describe a new naming paradigm for pen-based computing which we call Pictographic Naming. Using our approach, traditional character-by-character handwriting recognition (HWX) is avoided. Instead, we use a combination of user interface conventions and approximate matching techniques. Since pictographic names incorporate pen-stroke data, they can never be reproduced exactly, so name lookup bec...
متن کاملFactorization of Overlapping Harmonic Sounds Using Approximate Matching Pursuit
Factorization of polyphonic musical signals remains a difficult problem due to the presence of overlapping harmonics. Existing dictionary learning methods cannot guarantee that the learned dictionary atoms are semantically meaningful. In this paper, we explore the factorization of harmonic musical signals when a fixed dictionary of harmonic sounds is already present. We propose a method called ...
متن کاملReal-time scalable object detection
Real-time, scalable, multi-view object detection is an active area of research, particularly in robot vision community. An efficient template-based object detection algorithm has recently been proposed [1] that utilizes both color and depth information, and works on texture-less objects. However, the approach scales linearly with the number of objects and views. This project explores a variatio...
متن کاملApproximate string matching algorithms for limited-vocabulary OCR output correction
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012